Bayesian Inference - Introduction with Applications

Felipe Angelim

About me

Felipe Angelim

Tech Lead @ Mercado Libre

Core Dev @ Sktime

Creator/Dev @ Prophetverse

felipeangelim.com

Agenda

  1. Motivation
  2. Bayes: Priors, posteriors, and likelihoods
  3. Bayesian Inference applications

Motivation

Motivation

Motivation

  • Scalar number is not enough: we want to quantify uncertainty.
  • Small data requires regularization: with bayesian methods, we use priors and domain knowledge.

Priors, Posteriors, and Likelihoods

Key ingredients

Priors, Posteriors, and Likelihoods

\[ \underbrace{P(\theta|X)}_{\text{Posterior}} = \frac{\overbrace{P(X|\theta)}^{\text{Likelihood}} \; \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(X)}_{\text{Evidence}}} \implies \quad \quad P(\theta|X) {\LARGE \propto} P(X|\theta) P(\theta) \]

Inference

  • We only now how to compute \(f_X(\theta) = P(X|\theta)P(\theta)\), that is proportional to the posterior.
  • Inference methods:
    1. Sampling methods: e.g., Markov Chain Monte Carlo (MCMC).
    2. Variational inference: approximate the posterior with a simpler distribution.
    3. Maximum a Posteriori (MAP): find the mode of the posterior distribution.

Inference

Markov Chain Monte Carlo (MCMC)

  • Markov Chains: sequence of random events/states, with transition probabilities between them.
  • Can have stationary distributions, “equilibrium”.
  • Idea: build a chain whose “equilibrium” is the posterior distribution of interest.
  • Here Monte Carlo is used to be able to sample from such target distribution.

Inference

Markov Chain Monte Carlo (MCMC)

Simple linear regression

\[ \alpha \sim N(0, 10) \\ \beta \sim N(0, 10) \\ \sigma \sim \text{HalfCauchy}(5) \\ Y_i \sim N(\alpha + \beta X_i, \sigma) \]

import numpyro
from numpyro import distributions as dist

def linear_regression_model(x, y=None):
    
    # Priors on intercept and slope
    alpha = numpyro.sample("alpha", dist.Normal(0.0, 10.0))
    beta = numpyro.sample("beta", dist.Normal(0.0, 10.0))

    # Prior on noise scale (sigma > 0)
    sigma = numpyro.sample("sigma", dist.HalfCauchy(5.0))

    # Likelihood
    mean = alpha + beta * x
    with numpyro.plate("data", x.shape[0]):
        numpyro.sample("obs", dist.Normal(mean, sigma), obs=y)

Simple linear regression

from numpyro.infer import MCMC, NUTS
import jax.random as random

# Random key for reproducibility
rng_key = random.PRNGKey(0)

kernel = NUTS(linear_regression_model)
mcmc = MCMC(kernel, num_warmup=1000, num_samples=2000)
mcmc.run(rng_key, x_data, y_data)
posterior_samples = mcmc.get_samples()

Bayesian Neural Networks

  • Extend this idea of “random” parameters to neural networks’ weights.
  • Use priors on weights, and sample from the posterior distribution.
  • \(weights \sim N(0, I)\)

Inference

Variational Inference

  • We can also accept that the true posterior may be really hard to compute, and approximate it with a simpler distribution.
  • Search \(q \in Q\) that minimizes the Kullback-Leibler divergence \(D_{KL}(q || p(\cdot | X))\) w.r.t the true posterior.
  • Variational inference (VI) provides a fast and approximate solution to the problem.
  • Nowadays, libraries provide automatic Stochastic Variational Inference (SVI).
  • Usually the choice for Bayesian Neural Networks and large datasets.

Inference

Maximum A Posteriori (MAP)

  • Find the mode (maximum) of the posterior distribution.
  • Not a full Bayesian inference, but can be useful for fast inference and regularization.

\[ \arg\max_\theta P(\theta | X) \]

Inference

Maximum A Posteriori (MAP)

  • You are already using bayes!
  • Ridge regression, and Lasso, can be obtained from a Bayesian perspective!
  • Bayesian inference offer a more interpretable vision over regularization.

Ridge regression:

\[ Y|X = X\beta + \epsilon, \quad \epsilon \sim N(0, \sigma^2 I) \\ \hat{\beta} = \arg\min_\beta \left\{ ||y - X\beta||^2 + \lambda ||\beta||^2 \right\} \]

Bayesian Ridge regression:

\[ Y | X \sim N(X\beta, \sigma^2 I) \\ \beta \sim N(0, \frac{I}{\tau^2}) \\ \hat{\beta} = \arg\max_\beta P(\beta | X, Y) \]

You are already using bayes

\[ P(\beta) \]

Applications on timeseries

Total Addressable market (TAM) estimation

  • Problem: Estimate the total number of potential users/customers in a market.
  • Approach: Use a Bayesian model to estimate the growth of users over time, incorporating prior knowledge about market saturation.

\[ G(t) = \frac{C_1(t-t_0) + C_2}{\left(1 + \exp(-\alpha v (t - t_0))\right)^{\frac{1}{v}}} \]

\[ C_2 \in \mathbb{R}_+ = \text{is the constant capacity term}\\ C_1 \in \mathbb{R}_+ = \text{is the linear increasing rate of the capacity}\\ t_0 \in \mathbb{R} = \text{is the time offset term}\\ v \in \mathbb{R}_+ = \text{determines the shape of the curve} \\ \alpha \in \mathbb{R} = \text{is the rate} \]

Total Addressable market (TAM) estimation

Marketing Mix Modeling

  • Problem: Estimate the impact of different marketing channels on sales.
  • Approach: Use Bayesian models to:
    • Incorporate prior knowledge about the effectiveness of each channel.
    • Estimate the posterior distribution of the impact of each channel on sales.
    • Incorporate A/B testing results to refine estimates.

\[ E[\text{Sales} | \text{Marketing Channels}] = \text{trend} + \text{seasonality} + f_{\text{social_media}}(x_{\text{social_media}}) + f_{\text{email}}(x_{\text{email}}) + f_{\text{tv}}(x_{\text{tv}}) \]

Other applications

  • Forecasting for Inventory Management: Estimating the probability of stock-outs and optimal reorder points.
  • Censored Data Analysis: (e.g., survival analysis in medicine, reliability engineering)
  • A/B Testing: Quantifying \(P(\text{Variant A > Variant B})\) and the magnitude of difference.
  • Hierarchical Models: Sharing information between groups (e.g., price elasticity across different products/regions, user behavior in different cohorts).

Conclusion

  • Acts as regularization
  • Inference comes in many flavors: MCMC, Variational Inference, MAP.
  • Probabilistic Programming Languages (PPLs) make it easy to implement complex models.
  • Provide rich uncertainty quantification.
  • Natural way to incorporate domain knowledge through priors.

Thank you!

  • Join sktime and Prophetverse’s discord channel!

Extras

Motivation: Why Bayesian?

Which of these is a Bayesian statement, and which is Frequentist?

A. There is a 95% probability that the true value \(\theta\) lies in my interval \([A, B]\).

B. There is 95% chance that my interval \([A, B]\) contains the true quantity \(\theta\).

If \([A, B]\) is an interval generated by a model, and \(\theta\) is the parameter of interest.

Motivation: Interpreting Intervals

Answer:

A. (Bayesian): “There is a 95% probability that the true quantity \(\theta\) lies in \([A, B]\) * Treats \(\theta\) as random, data as fixed. Probability statement about the parameter.

B. (Frequentist): “There is 95% chance that \([A, B]\) contains the true quantity \(\theta\) * Treats \(\theta\) as fixed, data (and thus interval) as random. Statement about the procedure: if repeated many times, 95% of such intervals would capture the true \(\theta\).

Visualizing the Difference

Adapted from Jake VanderPlas. (Link to video).

Motivation

Principled Regularization

  • Priors act as a natural way to regularize models and prevent overfitting.
  • For example, Lasso and Ridge regressions are “bayesian”.